Objective

Develop a model that reflects the significance of cover, substrate, depth, and velocity on chinook salmon presence and absence.

Analysis Overview

Model was developed using Feather River Mini Snorkel Data data (filtered to Chinook Salmon) which consists of numeric fish count observations that can also be expressed as a binary presence–absence response. Because the count data were highly zero-inflated, with many observations in which no fish were detected, we initially evaluated a hurdle modeling approach following the framework described in Gard (2024, in review). Hurdle models are well suited for datasets dominated by absences, as they separately model the processes governing occurrence and abundance. In this case, the hurdle model showed reasonable performance for the presence–absence (zero) component but performed poorly for the count component, indicating that fish abundance could not be reliably predicted from the available covariates.

Given these limitations, subsequent analyses focused on modeling fish occurrence using logistic regression. Data were collected between March and August, a period that spans seasonal changes in habitat conditions and salmonid outmigration, resulting in strong temporal variation in fish presence. In addition, habitat conditions differ substantially between high-flow and low-flow channels. Initial exploratory analyses considered fitting separate models for each channel type; however, we ultimately adopted a unified logistic regression framework that allowed these differences to be incorporated through random effects while maintaining a single, interpretable model structure.

To account for spatial and temporal structure in the data, we evaluated a series of mixed-effects logistic regression models. Random intercepts were used to capture variation among transect sites, repeated sampling within sites, and seasonal differences among months. Among the models evaluated, the best-performing model included random effects for both site and month, reflecting the importance of spatial heterogeneity and seasonal dynamics in shaping fish occurrence. Including channel type as an additional random effect did not further improve model performance. This final model provided the strongest overall performance and served as the basis for inference on habitat associations with fish presence.

Review Data

## Rows: 17,338
## Columns: 44
## $ micro_hab_data_tbl_id                       <dbl> 18, 18, 19, 20, 21, 23, 24…
## $ location_table_id                           <dbl> 11, 11, 11, 11, 11, 11, 11…
## $ transect_code                               <dbl> 0.1, 0.1, 0.2, 0.3, 0.4, 3…
## $ fish_data_id                                <dbl> 21, 22, NA, NA, NA, NA, NA…
## $ date                                        <date> 2001-03-14, 2001-03-14, 2…
## $ count                                       <dbl> 2, 3, 0, 0, 0, 0, 0, 1, 25…
## $ species                                     <chr> "chinook salmon", "chinook…
## $ fl_mm                                       <dbl> 35, 35, NA, NA, NA, NA, NA…
## $ dist_to_bottom                              <dbl> 1.0, 1.5, NA, NA, NA, NA, …
## $ depth                                       <dbl> 17, 17, 19, 11, 12, 10, 8,…
## $ focal_velocity                              <dbl> 0.94, 0.16, NA, NA, NA, NA…
## $ velocity                                    <dbl> 0.22, 0.22, 0.35, 1.95, 2.…
## $ surface_turbidity                           <dbl> 20, 20, 30, 30, 30, 10, 10…
## $ percent_fine_substrate                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_sand_substrate                      <dbl> 40, 40, 50, 25, 0, 30, 0, …
## $ percent_small_gravel_substrate              <dbl> 20, 20, 40, 75, 80, 50, 60…
## $ percent_large_gravel_substrate              <dbl> 30, 30, 10, 0, 20, 20, 40,…
## $ percent_cobble_substrate                    <dbl> 10, 10, 0, 0, 0, 0, 0, 0, …
## $ percent_boulder_substrate                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_no_cover_inchannel                  <dbl> 75, 75, 100, 100, 100, 100…
## $ percent_small_woody_cover_inchannel         <dbl> 15, 15, 0, 0, 0, 0, 0, 0, …
## $ percent_large_woody_cover_inchannel         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_submerged_aquatic_veg_inchannel     <dbl> 10, 10, 0, 0, 0, 0, 0, 0, …
## $ percent_undercut_bank                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_no_cover_overhead                   <dbl> 100, 100, 100, 100, 100, 1…
## $ percent_cover_half_meter_overhead           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_cover_more_than_half_meter_overhead <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ channel_geomorphic_unit                     <chr> "glide", "glide", "glide",…
## $ location                                    <chr> "hatchery ditch", "hatcher…
## $ channel_location                            <chr> "LFC", "LFC", "LFC", "LFC"…
## $ water_temp                                  <dbl> 47, 47, 47, 47, 47, 47, 47…
## $ weather                                     <chr> "direct sunlight", "direct…
## $ flow                                        <dbl> 12, 12, 12, 12, 12, 12, 12…
## $ number_of_divers                            <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ reach_length                                <dbl> 25, 25, 25, 25, 25, 25, 25…
## $ reach_width                                 <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ channel_width                               <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7,…
## $ channel_type                                <chr> "sidechannel", "sidechanne…
## $ river_mile                                  <dbl> 66.6, 66.6, 66.6, 66.6, 66…
## $ coordinate_method                           <chr> "assigned based on similar…
## $ latitude                                    <dbl> 39.51602, 39.51602, 39.516…
## $ longitude                                   <dbl> -121.5588, -121.5588, -121…
## $ fish_presence                               <fct> 1, 1, 0, 0, 0, 0, 0, 1, 1,…
## $ month                                       <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3,…

Outliers

count outliers exist, in the high flow and the low flow channel, however their removal did not impact the model results so they were kept in the dataset.

We thought that by removing values greater than 250 this would limit overdispersion in the count model, however, it did not.

High Flow vs. Low Flow Channel

Table 1 and figure 2 explore whether fish presence was impacted by the high or low flow channels. Overall there are more salmon present in the low flow channel compared to the high flow channel (table 1). There are more fish present in the high flow channel in March but they move quickly downstream. Fish remain in the low flow channel for much longer time (figure 2).

Table 1. Total count of chinook salmon between high flow and low flow channels
channel_location n
HFC 8887
LFC 14946
Table 2. Number of sampling sites in the high flow and low flow channels
channel_location n_sites
HFC 16
LFC 13

Redd Data Integration and Exploration

Feather River redd survey data were sourced from the Environmental Data Initiative (EDI) and processed for integration with the Mini Snorkel habitat dataset. Locations with zero observed redds were removed, and remaining redd points were spatially joined to the nearest Mini Snorkel transect locations to ensure consistent site alignment. Redd information was then summarized at the transect scale, including both the total number of redds and a binary presence–absence indicator for spawning activity.

A key caveat is that the temporal coverage of the redd surveys (2014–2023) does not overlap with the Mini Snorkel observations (2001–2002). As a result, this exploration reflects an aggregate summary across the full redd dataset rather than a direct year-matched comparison. Rather than selecting a single “representative year” or subset of years, we chose to incorporate the entire redd record. This decision was made because no clear representative year could be identified, and summarizing across all available years allowed us to remove the temporal component while still capturing the overall spatial pattern of spawning activity when combining redd data with the Mini Snorkel habitat observations.

Redd data over time

This visual represents the number of redds over time at each location. It helps provide context on which sites generally have redds and if they have redds consistently over time. Qualitatively, it seems like there are more redds counted over time.

Total number of redds by year

year total_redds
2014 1916
2015 2361
2016 1570
2017 2722
2018 4169
2019 5044
2020 5431
2021 2594
2022 3761
2023 7323

Combine redd data with mini snorkel to see if redds are a spatial indicator or spawning potential/habitat quality.

Find the nearest transect location from the Mini Snorkel data to each redd

Redds are joined to the nearest mini snorkel transect location within 50-meters. The following histogram shows the distribution of the redd distances; 50-meters was chosen based off of the high counts (shown as the red dashed line).

A visual representation of the amount of redds joined with each of the mini snorkel transect locations:

Outmigration Analysis

Goal

Understand timing and patterns of juvenile outmigration on the Feather River and compare to timing and density of fish observations in the mini snorkel dataset.

Insights

  • The majority of catch in RSTs on the Feather River (both LFC and HFC) have passed through by March (~80%)
  • There are small differences in cumulative catch curves between the HFC and LFC indicating that outmigration is not affecting these sites differently
  • The fish that remain in the Feather River after March and into May and June are likely larger ( > 50mm) which aligns with the fork length distributions by month in the habitat data

Variables of Interest

The variables of interest include cover, substrate, velocity and depth variables known to be important for salmon rearing habitat. We are also including the number of redds found near each mini snorkel transect and whether or not there were redds nearby.

  • Velocity - numeric
  • Depth - numeric
  • Number or Redds - numeric
  • Redd at location - 0/1
  • Month - categorical (3-8)

Substrate and cover variables are measured as percentages are converted to binary presence/absence (1/0) by establishing a threshold percentage of 20%. Overhanging vegetation was measured at 1/2 meter overhead and more than 1/2 meter overhead. These categories were combined for simplicity and for comparison with other studies, such as those by Mark Gard.

  • Undercut bank - 0/1
  • Aquatic vegetation - 0/1
  • Overhanging vegetation - 0/1
  • Small woody cover - 0/1
  • Large woody cover - 0/1
  • Boulder substrate - 0/1
  • Cobble substrate - 0/1

Build Model Data

All cover variables were converted to presence/absence using a threshold of 20%. The following is the data structure of the model input data.

## Rows: 7,935
## Columns: 29
## $ count                                       <dbl> 2, 3, 0, 0, 0, 0, 0, 1, 25…
## $ location                                    <chr> "hatchery ditch", "hatcher…
## $ channel_location                            <chr> "LFC", "LFC", "LFC", "LFC"…
## $ depth                                       <dbl> 17, 17, 19, 11, 12, 10, 8,…
## $ velocity                                    <dbl> 0.22, 0.22, 0.35, 1.95, 2.…
## $ percent_small_woody_cover_inchannel         <dbl> 15, 15, 0, 0, 0, 0, 0, 0, …
## $ percent_large_woody_cover_inchannel         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_submerged_aquatic_veg_inchannel     <dbl> 10, 10, 0, 0, 0, 0, 0, 0, …
## $ percent_cover_half_meter_overhead           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_cover_more_than_half_meter_overhead <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_cobble_substrate                    <dbl> 10, 10, 0, 0, 0, 0, 0, 0, …
## $ percent_boulder_substrate                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ percent_undercut_bank                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ month                                       <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ channel_geomorphic_unit                     <chr> "glide", "glide", "glide",…
## $ reach_length                                <dbl> 25, 25, 25, 25, 25, 25, 25…
## $ reach_width                                 <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ channel_type                                <chr> "sidechannel", "sidechanne…
## $ small_woody                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ large_woody                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ boulder_substrate                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ cobble_substrate                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ undercut_bank                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ aquatic_veg                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ overhanging_veg                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ redd_total                                  <dbl> 2269, 2269, 2269, 2269, 22…
## $ redd_mean                                   <dbl> 1.000441, 1.000441, 1.0004…
## $ redd_median                                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ redd_presence                               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Model Performance Evaluation Overview

Model performance was evaluated using a combination of receiver operating characteristic (ROC) analysis and confusion matrix–based classification metrics. The area under the ROC curve (AUC) was used to assess the model’s ability to discriminate between presence and absence across all possible probability thresholds. AUC provides a threshold-independent measure of performance, where values near 0.5 indicate no discrimination and higher values indicate increasing ability to correctly rank presences above absences.

In addition, model predictions were converted to binary classifications using a fixed probability threshold, and a confusion matrix was used to summarize agreement between predicted and observed outcomes. The confusion matrix tabulates true positives, true negatives, false positives, and false negatives, allowing evaluation of classification behavior such as sensitivity to presences and specificity to absences. Because the dataset was imbalanced, with many more absences than presences, emphasis was placed on interpreting the confusion matrix in conjunction with AUC rather than relying on overall accuracy alone.

Confusion matrises have the following layout where “True” fish presences are represented as val1 and val4. “False” positives are represented by val2 as cases where fish presence was predicted but not observed. “False” negatives are represented by val3.

Observed Absence (0) Observed Presence (1)
Predicted Absence (0) val1 val3
Predicted Presence (1) val2 val4

Together, these metrics provide complementary perspectives on model performance: AUC characterizes overall discriminatory ability independent of threshold choice, while the confusion matrix illustrates how predictions behave at a specific cutoff and highlights trade-offs between detecting presences and avoiding false positives.

Hurdle Model Approach

Hurdle Models and Interpretation

A hurdle model was used in Gard 2024 (in-review) to test for the effects of cover and habitat type on the total abundance of Chinook salmon at both site and cell level. Here we explore the use oif a hurdle model to help understand the influence of velocity, depth, and cover on fish count and presence/absence.

Hurdle Models

Hurdle models are used when count data has an excess of zeros. These models can be understood as a mixture of two subset of populations. In one subset, we have a usual count model that may or may not generate zero, and the other subset only produce zero count.

A hurdle model models excess zeroes separately from the rest of the data. The zero counts are modeled as a binary response variable and the positive counts are modeled using poisson distribution.

Interpreting a Hurdle Model

The binary part of the model helps identify factors that influence the presence/absence of fish. The coefficients of the zero part of the hurdle model represent the odds ratio of observing at least one fish.

The count part of the model estimate the effects of predictor variables on the count outcome, excluding all zero counts. Coefficients of counts represent rate ratios of one or more fish observed.

The Incidence Result Ratio (IRR) in the count part of the model (count > 0) represent the multiplicative effect of a one-unit change in a predictor variable on the expected count of non-zero observations, assuming all other variables are held constant. For example, if the IRR for a predictor is 1.2, it means that a one-unit increase in that predictor is associated with a 20% increase in the expected count of non-zero observations, assuming all other variables remain constant. For the binary part of the model - if the coefficient for a predictor in the binary part of the hurdle model is 0.5, it means that a one-unit increase in the predictor is associated with a 50% increase in the odds of having a zero count versus a positive count, assuming all other variables are held constant.

Build Model

Hurdle Model Results Summary

## Start:  AIC=4558.18
## count ~ small_woody + depth + velocity + large_woody + aquatic_veg + 
##     overhanging_veg + cobble_substrate + boulder_substrate + 
##     undercut_bank + redd_total + redd_presence
##                     Df    AIC
## - boulder_substrate  2 4556.0
## - velocity           2 4556.5
## - cobble_substrate   2 4557.7
## <none>                 4558.2
## - large_woody        2 4558.3
## - undercut_bank      2 4561.7
## - redd_presence      2 4562.1
## - aquatic_veg        2 4563.1
## - small_woody        2 4565.5
## - overhanging_veg    2 4575.1
## - depth              2 4593.3
## - redd_total         2 4671.5
## 
## Step:  AIC=4556.02
## count ~ small_woody + depth + velocity + large_woody + aquatic_veg + 
##     overhanging_veg + cobble_substrate + undercut_bank + redd_total + 
##     redd_presence
##                    Df    AIC
## - velocity          2 4554.5
## - large_woody       2 4556.0
## - cobble_substrate  2 4556.0
## <none>                4556.0
## - undercut_bank     2 4559.4
## - redd_presence     2 4560.0
## - aquatic_veg       2 4560.8
## - small_woody       2 4563.5
## - overhanging_veg   2 4572.2
## - depth             2 4596.0
## - redd_total        2 4671.9
## 
## Step:  AIC=4554.5
## count ~ small_woody + depth + large_woody + aquatic_veg + overhanging_veg + 
##     cobble_substrate + undercut_bank + redd_total + redd_presence
## 
##                    Df    AIC
## - cobble_substrate  2 4554.1
## <none>                4554.5
## - large_woody       2 4554.6
## - undercut_bank     2 4557.9
## - aquatic_veg       2 4558.1
## - redd_presence     2 4558.1
## - small_woody       2 4563.5
## - overhanging_veg   2 4573.3
## - depth             2 4591.1
## - redd_total        2 4660.9
## 
## Step:  AIC=4554.14
## count ~ small_woody + depth + large_woody + aquatic_veg + overhanging_veg + 
##     undercut_bank + redd_total + redd_presence
## 
##                   Df    AIC
## - large_woody      2 4553.9
## <none>               4554.1
## - undercut_bank    2 4557.6
## - aquatic_veg      2 4557.7
## - redd_presence    2 4558.1
## - small_woody      2 4563.6
## - overhanging_veg  2 4573.3
## - depth            2 4588.0
## - redd_total       2 4661.2
## 
## Step:  AIC=4553.9
## count ~ small_woody + depth + aquatic_veg + overhanging_veg + 
##     undercut_bank + redd_total + redd_presence
## 
##                   Df    AIC
## <none>               4553.9
## - redd_presence    2 4557.6
## - undercut_bank    2 4557.7
## - aquatic_veg      2 4558.2
## - small_woody      2 4563.9
## - overhanging_veg  2 4574.3
## - depth            2 4587.7
## - redd_total       2 4662.3
## 
## Call:
## pscl::hurdle(formula = count ~ small_woody + depth + aquatic_veg + overhanging_veg + 
##     undercut_bank + redd_total + redd_presence, data = model_data, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42135 -0.08368 -0.06422 -0.04927 55.14739 
## 
## Count model coefficients (truncated negbin with log link):
##                   Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)      3.2266901  0.5700311   5.661 0.0000000151 ***
## small_woody      0.3506376  0.4543316   0.772      0.44025    
## depth            0.0233575  0.0064117   3.643      0.00027 ***
## aquatic_veg      0.3616881  0.4211060   0.859      0.39040    
## overhanging_veg  0.7163888  0.3758749   1.906      0.05666 .  
## undercut_bank   -0.6673778  0.7513705  -0.888      0.37443    
## redd_total      -0.0001439  0.0001409  -1.021      0.30727    
## redd_presence   -0.9326337  0.5374702  -1.735      0.08270 .  
## Log(theta)      -2.1880747  0.4568000  -4.790 0.0000016678 ***
## Zero hurdle model coefficients (binomial with logit link):
##                    Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)     -4.63922693  0.20161205 -23.011 < 0.0000000000000002 ***
## small_woody      0.67250697  0.17670229   3.806             0.000141 ***
## depth            0.01127768  0.00224564   5.022          0.000000511 ***
## aquatic_veg     -0.47521715  0.17901151  -2.655             0.007939 ** 
## overhanging_veg  0.70331579  0.15018862   4.683          0.000002829 ***
## undercut_bank    1.18860967  0.41269956   2.880             0.003976 ** 
## redd_total       0.00077913  0.00007016  11.105 < 0.0000000000000002 ***
## redd_presence    0.41664604  0.21517245   1.936             0.052827 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 0.1121
## Number of iterations in BFGS optimization: 32 
## Log-likelihood: -2260 on 17 Df

Hurdle Model Performance

Evaluate the presence / absence component (classification)

## Area under the curve: 0.7378
##          Observed
## Predicted    0    1
##         0 1819   31
##         1 5846  239

An AUC of 0.7378244 detects from predictive capabilities in dectecting presence/absence using habitat variables but it is weak. However the confusion matrix implies poor predictability of presence which is likely due to the dataset being so heavily skewed towards absence. Because non-zero counts were low and weakly explained, abundance predictions were unreliable.

Evaluate the count component (abundance, conditional on presence)

Only evaluate sites where count > 0.

##          rmse           mae            r2 
## 187.772806751  69.076957892   0.006929759

The count component of the hurdle model performed very poorly (RMSE = 150.6, MAE = 43.0, pseudo-R² = 0.02), indicating that the model explained virtually none of the variation in non-zero fish counts. Given the structure of the dataset—characterized by a high frequency of absences and occasional extremely large count values—this lack of performance is not unexpected, as such conditions make it difficult for the count component of a hurdle model to reliably capture abundance patterns.

Hurdle Model Discussion

The hurdle model showed reasonable performance for the presence–absence component (AUC = 0.7378244), indicating that the predictors captured meaningful structure in fish occurrence. However, the count component exhibited poor predictive skill, likely due to the distribution of fish counts, which was dominated by zeros and characterized by extremely high variability among non-zero observations (median = 0, mean = 2.8, maximum = 1500). This combination of many low counts and occasional extreme values violates the assumptions of a simple Poisson or truncated count process and limits the model’s ability to reliably predict abundance. As a result, we elected to focus subsequent analyses on habitat associations for fish occurrence using logistic regression, which is more consistent with the information content and statistical properties of the data.

Logistic Regression Approach

Because fish counts were highly zero-inflated and exhibited extreme variability among non-zero values, abundance models performed poorly. We therefore focused on modeling fish occurrence using logistic regression, which better matches the information content of the data and provides more reliable inference on habitat associations.

Build Logistic Regression Model

A simple logistic regression to start.

## 
## Call:
## glm(formula = presence ~ small_woody + depth + velocity + large_woody + 
##     aquatic_veg + overhanging_veg + cobble_substrate + boulder_substrate + 
##     undercut_bank + redd_total + redd_presence, family = binomial(link = "logit"), 
##     data = log_reg_data)
## 
## Coefficients:
##                      Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)       -4.53792574  0.20812158 -21.804 < 0.0000000000000002 ***
## small_woody        0.61225724  0.17909727   3.419              0.00063 ***
## depth              0.01113125  0.00226802   4.908          0.000000921 ***
## velocity          -0.14217267  0.09550485  -1.489              0.13658    
## large_woody        0.61186260  0.41866685   1.461              0.14389    
## aquatic_veg       -0.50784185  0.18164690  -2.796              0.00518 ** 
## overhanging_veg    0.64137590  0.15518691   4.133          0.000035818 ***
## cobble_substrate  -0.21626027  0.15461736  -1.399              0.16191    
## boulder_substrate  0.29143190  0.22925806   1.271              0.20366    
## undercut_bank      1.15456841  0.42015448   2.748              0.00600 ** 
## redd_total         0.00077947  0.00007304  10.671 < 0.0000000000000002 ***
## redd_presence      0.46224605  0.21689935   2.131              0.03308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2356.2  on 7934  degrees of freedom
## Residual deviance: 2043.7  on 7923  degrees of freedom
## AIC: 2067.7
## 
## Number of Fisher Scoring iterations: 7

Performance

## Area under the curve: 0.7847

AUC of 0.7847119 means The habitat variables explain some presence–absence structure, but there is still a lot of overlap/noise in the data.

Confusion Matrix:

##          Observed
## Predicted    0    1
##         0 7663  268
##         1    2    2

Logistic Regression with Random Effect

Because fish occurrence varied among transect sites and between high-flow and low-flow channels, we adopted a mixed-effects logistic regression framework to account for this spatial heterogeneity. The model estimates the probability of fish presence as a function of local habitat features and spawning context, while allowing baseline occurrence to vary among channel types and individual transect sites through random effects.

We first fit a model that included channel type as a random effect and then extended this structure by nesting transect sites within channel type. The nested model provided a modest improvement in overall discriminatory performance based on AUC, while classification outcomes at a fixed threshold remained unchanged, indicating that the additional random effects improved the model’s ability to rank sites by likelihood of presence rather than alter binary predictions.

Random Effect of Location

Using the high flow, low flow channel as the random effect:

##  Family: binomial  ( logit )
## Formula:          
## presence ~ small_woody + depth + velocity + large_woody + aquatic_veg +  
##     overhanging_veg + cobble_substrate + boulder_substrate +  
##     undercut_bank + redd_total + redd_presence + (1 | location)
## Data: log_reg_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##    1995.7    2086.5    -984.9    1969.7      7922 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  location (Intercept) 0.4547   0.6743  
## Number of obs: 7935, groups:  location, 29
## 
## Conditional model:
##                     Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)       -4.9857062  0.3382477 -14.740 < 0.0000000000000002 ***
## small_woody        0.5918062  0.1805585   3.278             0.001047 ** 
## depth              0.0158105  0.0023712   6.668       0.000000000026 ***
## velocity          -0.3293563  0.1082604  -3.042             0.002348 ** 
## large_woody        0.3714054  0.4271962   0.869             0.384627    
## aquatic_veg       -0.4444705  0.1851405  -2.401             0.016363 *  
## overhanging_veg    0.6947510  0.1586632   4.379       0.000011934676 ***
## cobble_substrate   0.1276086  0.1720163   0.742             0.458184    
## boulder_substrate  0.4799388  0.2434783   1.971             0.048704 *  
## undercut_bank      1.0874731  0.4517521   2.407             0.016074 *  
## redd_total         0.0007826  0.0002287   3.422             0.000621 ***
## redd_presence      0.5356947  0.3907809   1.371             0.170428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Area under the curve: 0.8215
##          Observed
## Predicted    0    1
##         0 7662  257
##         1    3   13

Random Effect of Channel Location | Location

Nesting the transect location within the high flow/low flow channel as the random effect:

##  Family: binomial  ( logit )
## Formula:          
## presence ~ small_woody + depth + velocity + large_woody + aquatic_veg +  
##     overhanging_veg + cobble_substrate + boulder_substrate +  
##     undercut_bank + redd_total + redd_presence + (1 | channel_location/location)
## Data: log_reg_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##    1997.7    2095.4    -984.9    1969.7      7921 
## 
## Random effects:
## 
## Conditional model:
##  Groups                    Name        Variance      Std.Dev.  
##  location:channel_location (Intercept) 0.45471325028 0.67432429
##  channel_location          (Intercept) 0.00000000995 0.00009975
## Number of obs: 7935, groups:  
## location:channel_location, 29; channel_location, 2
## 
## Conditional model:
##                     Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)       -4.9857571  0.3382526 -14.740 < 0.0000000000000002 ***
## small_woody        0.5918091  0.1805586   3.278              0.00105 ** 
## depth              0.0158105  0.0023712   6.668       0.000000000026 ***
## velocity          -0.3293483  0.1082603  -3.042              0.00235 ** 
## large_woody        0.3714329  0.4271945   0.869              0.38459    
## aquatic_veg       -0.4444623  0.1851406  -2.401              0.01636 *  
## overhanging_veg    0.6947498  0.1586634   4.379       0.000011935374 ***
## cobble_substrate   0.1276046  0.1720166   0.742              0.45820    
## boulder_substrate  0.4799272  0.2434791   1.971              0.04871 *  
## undercut_bank      1.0874645  0.4517523   2.407              0.01607 *  
## redd_total         0.0007826  0.0002287   3.423              0.00062 ***
## redd_presence      0.5357192  0.3907818   1.371              0.17041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Area under the curve: 0.8215
##          Observed
## Predicted    0    1
##         0 7662  257
##         1    3   13

Random Effect of Location and Month

Using the high flow, low flow channel and month as the random effect.

Including month as a random effect substantially improved model performance, increasing AUC from approximately 0.82 to 0.91. This indicates that accounting for seasonal variation greatly enhanced the model’s ability to discriminate between locations with and without fish presence. Despite this improvement, confusion matrices evaluated at a fixed probability threshold showed similar classification outcomes across models. This reflects the rarity of presence observations and the fact that many true presences receive predicted probabilities below the default classification cutoff. Thus, the month random-effects model improves probabilistic ranking and ecological realism without substantially altering threshold-based classifications.

##  Family: binomial  ( logit )
## Formula:          
## presence ~ small_woody + depth + velocity + large_woody + aquatic_veg +  
##     overhanging_veg + cobble_substrate + boulder_substrate +  
##     undercut_bank + redd_total + redd_presence + (1 | location) +  
##     (1 | month)
## Data: log_reg_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##    1728.3    1826.0    -850.2    1700.3      7921 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  location (Intercept) 0.4867   0.6977  
##  month    (Intercept) 3.0424   1.7442  
## Number of obs: 7935, groups:  location, 29; month, 6
## 
## Conditional model:
##                     Estimate Std. Error z value          Pr(>|z|)    
## (Intercept)       -5.9283395  0.8135529  -7.287 0.000000000000317 ***
## small_woody        0.2674058  0.1919898   1.393          0.163676    
## depth              0.0145450  0.0024125   6.029 0.000000001650811 ***
## velocity          -0.6514383  0.1327581  -4.907 0.000000924998595 ***
## large_woody       -0.2250985  0.4770764  -0.472          0.637049    
## aquatic_veg       -0.4599767  0.1939002  -2.372          0.017681 *  
## overhanging_veg    0.7841664  0.1680981   4.665 0.000003087183362 ***
## cobble_substrate  -0.2686053  0.1816267  -1.479          0.139171    
## boulder_substrate  0.4048040  0.2473029   1.637          0.101657    
## undercut_bank      0.5586185  0.5259972   1.062          0.288227    
## redd_total         0.0008865  0.0002392   3.706          0.000211 ***
## redd_presence      0.7290142  0.4150187   1.757          0.078989 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Area under the curve: 0.9072
##          Observed
## Predicted    0    1
##         0 7649  232
##         1   16   38

Figures

The following figure represents the overall effect of month on fish presence while keeping all other months constant. Points greater than 1 represent a higher probability of fish presence and points less than 1 represent a lower probability. Wider bars represent greater uncertainty.

The following figure represents the overall effect of each of the sites on fish presence while keeping all other sites constant. Points greater than 1 represent a higher probability of fish presence and points less than 1 represent a lower probability. Wider bars represent greater uncertainty.

The following figure represents the overall effect of each of the predictors for fish presence when keeping all other predictors constant. Points greater than 1 represent a higher probability of fish presence and points less than 1 represent a lower probability. Wider bars represent greater uncertainty.

Although depth and total number of redds have odds ratios close to one, both predictors are statistically significant, indicating small but consistent effects on fish presence. The magnitude of these effects is modest on a per-unit basis, but their significance reflects the precision of the estimates and the large sample size rather than a lack of biological relevance.

Logistic Regression with Random Effect Discussion

Chinook Salmon occurrence in the Feather River was shaped by a combination of local habitat features, spawning context, and strong spatial and seasonal structure. Incorporating random effects for both transect location and month substantially improved model performance, yielding high discriminatory ability (AUC = 0.91) and indicating that fish presence is influenced not only by measured habitat covariates but also by broader spatial and temporal processes.

Seasonal and Spatial Structure

The inclusion of month as a random effect revealed pronounced seasonal variation in baseline salmon occurrence. Month-level random intercepts varied widely, with odds of presence in early spring months (March–April) approximately 4–5 times higher than the global mean, while late-summer months exhibited substantially reduced odds. This pattern is consistent with seasonal dynamics in salmonid behavior, including outmigration earlier in the sampling period, and highlights the importance of accounting for temporal structure when modeling fish presence.

Site-level random effects also indicated persistent spatial heterogeneity across transects. Several locations exhibited consistently lower baseline probabilities of fish presence (e.g., Vance Avenue and Big Bar), while others were closer to or above the global mean. These differences suggest that reach-scale or site-specific factors—such as connectivity, geomorphic context, or unmeasured hydraulic or thermal conditions—continue to influence fish distribution beyond the microhabitat variables included in the model.

Habitat Associations

After accounting for spatial and seasonal variability, several local habitat features showed strong and consistent associations with fish presence. Overhanging vegetation emerged as one of the most influential predictors, with sites containing riparian cover exhibiting more than double the odds of fish presence relative to sites without it. This finding underscores the importance of riparian structure in providing cover, refuge, and potentially favorable thermal conditions.

Depth was positively associated with fish presence, although the per-unit effect was small (odds ratio ≈ 1.01). Despite its modest magnitude, this effect was highly significant. In contrast, velocity showed a strong negative association with presence, suggesting that fish were less likely to occupy faster-flowing units once other habitat features were accounted for.

Aquatic vegetation was negatively associated with fish presence, indicating that areas dominated by vegetation may provide less suitable habitat in this system, potentially due to reduced hydraulic complexity or limited refuge availability. Other structural features, including small woody debris, large woody debris, undercut banks, and substrate types (cobble and boulder), did not exhibit statistically significant effects in the final model once spatial and seasonal variation were accounted for. This does not imply that these features are unimportant, but rather that their effects may be context-dependent or partially captured by correlated habitat or site-level factors.

Spawning Context

Spawning activity was positively associated with fish occurrence. The total number of nearby redds showed a significant positive relationship with fish presence, indicating that areas with greater spawning activity were more likely to support fish during snorkel surveys. The binary indicator of redd presence showed a positive but non-significant effect, suggesting that the intensity of spawning activity, rather than simple presence or absence of redds, may be more informative for predicting fish occurrence.

Synthesis and Implications

Together, these results indicate that fish distribution in the study system is governed by an interplay between fine-scale habitat conditions and broader spatial and temporal processes. Structural habitat features—particularly riparian cover and depth—play important roles in shaping local occurrence, while strong seasonal effects and persistent site-level differences underscore the influence of processes operating at larger spatial and temporal scales. The strong performance of the mixed-effects model highlights the value of accounting for repeated sampling, seasonal dynamics, and spatial heterogeneity when modeling fish occurrence in riverine systems.